Scalable Analytics Model Calibration with Online Aggregation

نویسندگان

  • Florin Rusu
  • Chengjie Qin
  • Martin Torres
چکیده

Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the lack of support to quickly identify sub-optimal configurations is the principal cause. In this paper, we apply parallel online aggregation to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. The end-result is online approximate gradient descent—a novel optimization method for scalable model calibration. We show how online approximate gradient descent can be represented as generic database aggregation and implement the resulting solution in GLADE—a state-of-the-art Big Data analytics system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Speculative Approximations for Terascale Analytics

Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurat...

متن کامل

GLADE-ML: A Database For Big Data Analytics

Big Data Analytics has been a hot topic in computing systems and varies systems have emerged to better support Big Data Analytics. Though databases have been the data hub for decades, they fall short of Big Data Analytics due to inherent limitations. This dissertation present GLADEML, a scalable and efficient parallel database that is specifically tailored for Big Data Analytics. Different from...

متن کامل

Towards Security in Distributed Home System

Today, personal data analytics and privacy face a dichotomy: application authors and service providers require scalable analytics systems, while the users and regulators increasingly demand for applications which respect the individuals’ privacy. In this paper we propose to use VPN to solve new security challenges in a distributed home network system. On a prototype implementation, our initial ...

متن کامل

Spatial Online Sampling and Aggregation

The massive adoption of smart phones and other mobile devices has generated humongous amount of spatial and spatio-temporal data. The importance of spatial analytics and aggregation is everincreasing. An important challenge is to support interactive exploration over such data. However, spatial analytics and aggregation using all data points that satisfy a query condition is expensive, especiall...

متن کامل

Scalable Social Analytics for Online Communities

With the constantly growing ecosphere of online communities, their managers, operators and members can hugely benefit from a rich set of tools to successfully understand, control, exploit and utilise them. This requires to extract reusable, interpretable analytics in real time from the streams of dynamically, socially produced data. In this article, we summarise our efforts in the context of th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IEEE Data Eng. Bull.

دوره 38  شماره 

صفحات  -

تاریخ انتشار 2015